clear
set more off
macro drop all
capture log close

/********************************************************************************
Discrimination in Multi-Phase Systems: Evidence from Child Protection
Match Master DHHS Data to ED Data from MCER

Created on: 2/26/2022
Last Modified on: 2/13/2024

Description: This do file matches the cleaned DHHS data to student identifiers in
	     the MCER data.  

Note that we have removed the file directory names from this program for 
confidentiality reasons. 
********************************************************************************/

** Setting the Directory
global rawdata 
global cleandata 
global tmp

/********************************************************************************

To link the DHHS data to the ED data, I take the following steps:

(1) Use the link table to create 1:1 matches between ric's and victim id's 

(2) Merge victim id's to rics in the master DHHS analysis file

*******************************************************************************/

*********************
**(1) CREATE 1:1 MATCHES BETWEEN RIC AND VICTIM ID
*********************

import delimited "$rawdata/cw-edu-link-file.csv", clear
rename childpartyid vicid

**First step is to get this file to be unique at the vicid.

**(a) Do this by first by dropping an obs if two different victim id's match
**to the same ric and one is with level 2 certainty while the other is with
**level 1 certainty.
duplicates tag ric, gen(dups)
sort ric certainty
bysort ric: gen flag=1 if certainty[1]!=certainty[_N] & dups>0
drop if flag==1 & certainty=="Level 2"
drop dups flag certainty match_group

**(b) Deal with the 19% cases where the same ric links to more than one vicid by
**randomly choosing which of the duplicate vicids to associate with a given ric
sort ric. Note that most of these ric's are obsolete anyway and won't match to anything.
gen flag=0
replace flag=1 if ric==ric[_n-1] & vicid!=vicid[_n-1]
bysort ric: egen flagmax=max(flag)
drop flag
rename flagmax flag
tab flag

/*
       flag |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  1,266,850       81.33       81.33
          1 |    290,860       18.67      100.00
------------+-----------------------------------
      Total |  1,557,710      100.00
*/

set seed 11223344
gen random=runiform() if flag==1
bysort ric: egen randmin=min(random)
drop if flag==1 & random!=randmin
drop random randmin flag

**(c) Finally, reshape the data to be unique at vicid level. Assigning all rics that
**each vicid corresponds to.
bysort vicid: gen n=_n
reshape wide ric, i(vicid) j(n)

**Save link file
sort vicid
compress
save "$cleandata/dhhs_ed_link_table.dta", replace

***********************
***(2) MATCH VICTIM ID'S TO RICS IN THE CHILD*FIRST INVESTIGATION DHHS DATA
***********************
use "${cleandata}analysis_sample.dta", clear
sort vicid
cap drop _merge
tempfile s
save `s'

**Now merge to rics
use "$cleandata/dhhs_ed_link_table.dta", clear
sort vicid
merge 1:m vicid using `s'
drop if _merge==1
tab _merge
drop _merge

save "${cleandata}analysis_sample_with_rics.dta", replace


























